This notebook works on creating some plots with ggplot2.

Load packages

# Set global chunk options (include blocks that throw errors)
knitr::opts_chunk$set(error = TRUE)

# For data wrangling
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.1     ✔ tibble    3.2.1
## ✔ lubridate 1.9.3     ✔ tidyr     1.3.1
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(palmerpenguins) # includes the "penguins" dataset
library(ggthemes) # include ggplot themes for pretty plotting

Playing with the penguins dataset

Do penguins with longer flippers weigh more or less than penguins with shorter flippers? What does the relationship look like?

Note > Note that it says tibble on top of this preview. In the tidyverse, we use special data frames called tibbles that you will learn more about soon.

Inspect the data

penguins
## # A tibble: 344 × 8
##    species island    bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
##    <fct>   <fct>              <dbl>         <dbl>             <int>       <int>
##  1 Adelie  Torgersen           39.1          18.7               181        3750
##  2 Adelie  Torgersen           39.5          17.4               186        3800
##  3 Adelie  Torgersen           40.3          18                 195        3250
##  4 Adelie  Torgersen           NA            NA                  NA          NA
##  5 Adelie  Torgersen           36.7          19.3               193        3450
##  6 Adelie  Torgersen           39.3          20.6               190        3650
##  7 Adelie  Torgersen           38.9          17.8               181        3625
##  8 Adelie  Torgersen           39.2          19.6               195        4675
##  9 Adelie  Torgersen           34.1          18.1               193        3475
## 10 Adelie  Torgersen           42            20.2               190        4250
## # ℹ 334 more rows
## # ℹ 2 more variables: sex <fct>, year <int>

An alternative view that shows all columns (and they types) can use the glimpse method like so…

glimpse(penguins)
## Rows: 344
## Columns: 8
## $ species           <fct> Adelie, Adelie, Adelie, Adelie, Adelie, Adelie, Adel…
## $ island            <fct> Torgersen, Torgersen, Torgersen, Torgersen, Torgerse…
## $ bill_length_mm    <dbl> 39.1, 39.5, 40.3, NA, 36.7, 39.3, 38.9, 39.2, 34.1, …
## $ bill_depth_mm     <dbl> 18.7, 17.4, 18.0, NA, 19.3, 20.6, 17.8, 19.6, 18.1, …
## $ flipper_length_mm <int> 181, 186, 195, NA, 193, 190, 181, 195, 193, 190, 186…
## $ body_mass_g       <int> 3750, 3800, 3250, NA, 3450, 3650, 3625, 4675, 3475, …
## $ sex               <fct> male, female, female, NA, female, male, female, male…
## $ year              <int> 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007…

In Rstudio, you can also use View() to open an interactive viewer that opens the file in another tab…

Create a scatterplot of the data

Below, aes stands for “aesthetics.” Aesthetics are always set within the mapping parameter, but can consist of many things.

ggplot(
    data = penguins, # first variable is always the data
    mapping = aes(x = flipper_length_mm, y = body_mass_g)
)

The plot above is blank because we have not told ggplot2 what to place any observations on the “canvas” yet.

To do that, we need to define a geom: the geometrical object that a plot uses to represent data. These are things like bars, lines, boxplots, etc.

ggplot(
    data = penguins, # first variable is always the data
    mapping = aes(x = flipper_length_mm, y = body_mass_g)
) + geom_point() # add points!
## Warning: Removed 2 rows containing missing values or values outside the scale range
## (`geom_point()`).

Alter the aesthetics of the data

Color the points

We can simply add another parameter into the aes mapping and to color points based on penguin species.

ggplot(
    data = penguins, # first variable is always the data
    mapping = aes(x = flipper_length_mm, y = body_mass_g, color = species)
) + geom_point() # add points!
## Warning: Removed 2 rows containing missing values or values outside the scale range
## (`geom_point()`).

Add fit lines

To add fit lines, we must add another geometric object, a smoothed line. The way that this is done in ggplot2 is with the geom_smooth() object.

ggplot(
    data = penguins, # first variable is always the data
    mapping = aes(x = flipper_length_mm, y = body_mass_g, color = species)
) + 
    geom_point() + # Add points
    geom_smooth(method = "lm") # Add a smoothed line use a *l*inear *m*odel
## `geom_smooth()` using formula = 'y ~ x'
## Warning: Removed 2 rows containing non-finite outside the scale range
## (`stat_smooth()`).
## Warning: Removed 2 rows containing missing values or values outside the scale range
## (`geom_point()`).

How do we create a fit line for all data?

When you define aesthetic properties in the ggplot function at the global level, then all of these aesthetic properties are passed down to “lower-level” geom_ properties. However, we can define aes properties within specific geom_ properties so that we can apply the fit line to the entire dataset.

ggplot(
    data = penguins, # first variable is always the data
    mapping = aes(x = flipper_length_mm, y = body_mass_g)
) + 
    geom_point(mapping = aes(color = species)) + # Add points, set color mapping here
    geom_smooth(method = "lm") # Add a smoothed line use a *l*inear *m*odel — now for the whole dataset
## `geom_smooth()` using formula = 'y ~ x'
## Warning: Removed 2 rows containing non-finite outside the scale range
## (`stat_smooth()`).
## Warning: Removed 2 rows containing missing values or values outside the scale range
## (`geom_point()`).

Control geom_ shapes

We can again set this within the specific geometric property we want to alter.

ggplot(
    data = penguins, # first variable is always the data
    mapping = aes(x = flipper_length_mm, y = body_mass_g)
) + 
    geom_point(mapping = aes(color = species, shape = species)) + # Now set different shapes for species
    geom_smooth(method = "lm")
## `geom_smooth()` using formula = 'y ~ x'
## Warning: Removed 2 rows containing non-finite outside the scale range
## (`stat_smooth()`).
## Warning: Removed 2 rows containing missing values or values outside the scale range
## (`geom_point()`).

Clean up labels, use color-friendly color scale

The labs function helps control the labels and the scale_color_colorblind() can be added at the end.

ggplot(
    data = penguins, # first variable is always the data
    mapping = aes(x = flipper_length_mm, y = body_mass_g)
) + 
    geom_point(mapping = aes(color = species, shape = species)) + # Now set different shapes for species
    geom_smooth(method = "lm") +
    labs(
        title = "Body mass and flipper length",
        subtitle = "Dimensions for Adelie, Chinstrap, and Gentoo Penguins",
        x = "Flipper length (mm)", y = "Body mass (g)",
        color = "Species", shape = "Species"
    ) +
    scale_color_colorblind()
## `geom_smooth()` using formula = 'y ~ x'
## Warning: Removed 2 rows containing non-finite outside the scale range
## (`stat_smooth()`).
## Warning: Removed 2 rows containing missing values or values outside the scale range
## (`geom_point()`).

Exercises

1. How many rows are in penguins? How many columns?

nrow(penguins)
## [1] 344

2. What does the bill_depth_mm variable in the penguins data frame describe? Read the help for ?penguins to find out.

?penguins

Definition: > a number denoting bill depth (millimeters)

3 Make a scatterplot of bill_depth_mm vs. bill_length_mm. That is, make a scatterplot with bill_depth_mm on the y-axis and bill_length_mm on the x-axis. Describe the relationship between these two variables.

ggplot(
    data = penguins,
) +
    geom_point(mapping = aes(x = bill_length_mm, y = bill_depth_mm))
## Warning: Removed 2 rows containing missing values or values outside the scale range
## (`geom_point()`).

Doesn’t seem like there is much of a relationship at this level of analysis.

4. What happens if you make a scatterplot of species vs. bill_depth_mm? What might be a better choice of geom?

ggplot(
    data = penguins,
) +
    geom_point(mapping = aes(x = species, y = bill_depth_mm))
## Warning: Removed 2 rows containing missing values or values outside the scale range
## (`geom_point()`).

Seems like the Gentoo have noticeably smaller bill depths when compared to the other species. Bit more variance for the Adelie, as well.

5. Why does the following give an error and how would you fix it?

ggplot(data = penguins) + 
  geom_point()
## Error in `geom_point()`:
## ! Problem while setting up geom.
## ℹ Error occurred in the 1st layer.
## Caused by error in `compute_geom_1()`:
## ! `geom_point()` requires the following missing aesthetics: x and y.

This throws an error because we’ve not specified any aesthetics. ggplot does not know how to draw the points without instructions. We can fix this simply by telling it what to draw on the x and y axes.

ggplot(data = penguins) + 
  geom_point(mapping = aes(x = sex, y = bill_length_mm))
## Warning: Removed 2 rows containing missing values or values outside the scale range
## (`geom_point()`).

6. What does the na.rm argument do in geom_point()? What is the default value of the argument? Create a scatterplot where you successfully use this argument set to TRUE.

?geom_point

Definition > If FALSE, the default, missing values are removed with a warning. If TRUE, missing values are silently removed.

We can make the same plot as before and see that it no longer raises a warning…

ggplot(
    data = penguins,
) +
    geom_point(mapping = aes(x = species, y = bill_depth_mm), na.rm=TRUE)

7. Add the following caption to the plot you made in the previous exercise: “Data come from the palmerpenguins package.” Hint: Take a look at the documentation for labs().

?labs
ggplot(
    data = penguins,
) +
    geom_point(mapping = aes(x = species, y = bill_depth_mm), na.rm=TRUE) +
    labs(caption = "Data come from the `palmerpenguins` package.")

8. Recreate the following visualization. What aesthetic should bill_depth_mm be mapped to? And should it be mapped at the global level or at the geom level?

Picture from the website.
Picture from the website.

We want bill_depth_mm to control the color of the points.

To understand the different methods for the fit line, we run the below…

?geom_smooth

Looks like their is a loess smoothing method, which seems right…

ggplot(
    data = penguins,
    mapping = aes(x = flipper_length_mm, y = body_mass_g)
) +
    geom_smooth(method = 'loess', na.rm=TRUE) + 
    geom_point(
        mapping = aes(color = bill_depth_mm), # We add color here to avoid error from geom_smooth
        na.rm=TRUE
    )
## `geom_smooth()` using formula = 'y ~ x'

9. Run this code in your head and predict what the output will look like. Then, run the code in R and check your predictions.

Prediction > Scatterplot where x axis is the flipper length, the y axis is the body mass, and color identifies the island of the penguins. I am assuming that we’re going to have something similar to the scatterplot above where we have three lines that are related to the type of penguin (which I am assuming live in different geographically locations). There will be three different lines because the color mapping is made at the global level within the ggplot function. The se=FALSE portion seems like it will remove the error bands.

ggplot(
  data = penguins,
  mapping = aes(x = flipper_length_mm, y = body_mass_g, color = island)
) +
  geom_point() +
  geom_smooth(se = FALSE)
## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'
## Warning: Removed 2 rows containing non-finite outside the scale range
## (`stat_smooth()`).
## Warning: Removed 2 rows containing missing values or values outside the scale range
## (`geom_point()`).

What I got wrong and what I learned

I thought the lines would be linear by default, but it seems like ggplot applies the “loess” method by default—but only for smaller datasets—because they think it looks better.

Run the below to learn more…

?geom_smooth

From the method documentation > stats::loess() is used for less than 1,000 observations; otherwise mgcv::gam() is used with formula = y ~ s(x, bs = “cs”) with method = “REML”. Somewhat anecdotally, loess gives a better appearance, but is O(n^2) in memory, so does not work for larger datasets.

10. Will these two graphs look different? Why/why not?

Answer > No. The reason is because the mapping/aes properties set by the ggplot() function are passed to all of the geometric properties at lower levels. So, because they are defined for both geometric properties in the second set of code, they will be identical.

ggplot(
  data = penguins,
  mapping = aes(x = flipper_length_mm, y = body_mass_g)
) +
  geom_point() +
  geom_smooth()
## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'
## Warning: Removed 2 rows containing non-finite outside the scale range
## (`stat_smooth()`).
## Warning: Removed 2 rows containing missing values or values outside the scale range
## (`geom_point()`).

ggplot() +
  geom_point(
    data = penguins,
    mapping = aes(x = flipper_length_mm, y = body_mass_g)
  ) +
  geom_smooth(
    data = penguins,
    mapping = aes(x = flipper_length_mm, y = body_mass_g)
  )
## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'
## Warning: Removed 2 rows containing non-finite outside the scale range
## (`stat_smooth()`).
## Warning: Removed 2 rows containing missing values or values outside the scale range
## (`geom_point()`).

Note on writing code…

Basically, including data and mapping in the gglplot call is very declarative but a bit much. Typically, people do not include this as we know what the first two variables are and code looks like the below…

ggplot(
  penguins,
  aes(x = flipper_length_mm, y = body_mass_g)
) +
  geom_point() +
  geom_smooth()
## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'
## Warning: Removed 2 rows containing non-finite outside the scale range
## (`stat_smooth()`).
## Warning: Removed 2 rows containing missing values or values outside the scale range
## (`geom_point()`).

Another approach uses the pipe syntax, which will be discussed more later.

penguins |> 
  ggplot(aes(x = flipper_length_mm, y = body_mass_g)) + 
  geom_point()
## Warning: Removed 2 rows containing missing values or values outside the scale range
## (`geom_point()`).

Visualizing Distributions

Categorical variables

The below simply sets the height of the bars based on how many observations occurred with each x value.

ggplot(penguins, aes(x = species)) +
    geom_bar()

If we want to order them by frequency, we can use a specific function for this, which is fct_infreq.

What this function means is that we first convert the variable to a factor and order the variables based on frequency. The language for the function infreq seems weird at first, but makes more sense when you know the other similar functions: > - fct_inorder(): by the order in which they first appear. > - fct_infreq(): by number of observations with each level (largest first) > - fct_inseq(): by numeric value of level. > > source

ggplot(penguins, aes(x = fct_infreq(species))) +
  geom_bar()

Numerical distributions

Histograms

ggplot(
    penguins, 
    aes(x = body_mass_g)) + # Place body mass on the x-axis
    geom_histogram(binwidth = 200) # Draw histrograms
## Warning: Removed 2 rows containing non-finite outside the scale range
## (`stat_bin()`).

Make sure to experiment with many binwidths!

ggplot(penguins, aes(x = body_mass_g)) +
  geom_histogram(binwidth = 20)
## Warning: Removed 2 rows containing non-finite outside the scale range
## (`stat_bin()`).

ggplot(penguins, aes(x = body_mass_g)) +
  geom_histogram(binwidth = 2000)
## Warning: Removed 2 rows containing non-finite outside the scale range
## (`stat_bin()`).

Density plots

ggplot(penguins, aes(x = body_mass_g)) +
  geom_density()
## Warning: Removed 2 rows containing non-finite outside the scale range
## (`stat_density()`).

Exercises

1. Make a bar plot of species of penguins, where you assign species to the y aesthetic. How is this plot different?

ggplot(
    penguins,
    aes(y=species)
) +
    geom_bar()

ggplot(
    penguins,
    aes(y=fct_inorder(species))
) +
    geom_bar()

2. How are the following two plots different? Which aesthetic, color or fill, is more useful for changing the color of bars?

ggplot(penguins, aes(x = species)) +
    geom_bar(color = "red")

ggplot(penguins, aes(x = species)) +
  geom_bar(fill = "red")

Putting them together makes what is going on here obvious…

ggplot(penguins, aes(x = species)) +
  geom_bar(fill = "red", color='purple')

3. What does the bins argument in geom_histogram() do?

Controls the number of bins to use in the plot.

ggplot(
    penguins,
    aes(x=body_mass_g)
) +
    geom_histogram(bins=5)
## Warning: Removed 2 rows containing non-finite outside the scale range
## (`stat_bin()`).

ggplot(
    penguins,
    aes(x=body_mass_g)
) +
    geom_histogram(bins=2)
## Warning: Removed 2 rows containing non-finite outside the scale range
## (`stat_bin()`).

From documentation > bins: Number of bins. Overridden by binwidth. Defaults to 30.

4. Make a histogram of the carat variable in the diamonds dataset that is available when you load the tidyverse package. Experiment with different binwidths. What binwidth reveals the most interesting patterns?

The below is the most interesting to me, as it shows the clear grouping around specific thresholds of importance to people. This seems to indicate the people want to get to nice clean carat threshold of, for example, .5, 1.0, 1.5, or 2.0.

ggplot(
    diamonds,
    aes(x=carat)
) +
    geom_histogram(binwidth = .01)

Visualizing relationships

This requires at least two variables…

Numerical and categorical

Making a boxplot…

ggplot(
    penguins, 
    aes(x = species, y = body_mass_g)
) +
    geom_boxplot()
## Warning: Removed 2 rows containing non-finite outside the scale range
## (`stat_boxplot()`).

Density plots for each species…

ggplot(
    penguins, 
    aes(x = body_mass_g, color = species)
) +
  geom_density(linewidth = 1)
## Warning: Removed 2 rows containing non-finite outside the scale range
## (`stat_density()`).

Filling the density plot and controling the opacity with alpha

ggplot(
    penguins,
    aes(x = body_mass_g, color = species, fill = species)
) +
  geom_density(alpha = 0.5)
## Warning: Removed 2 rows containing non-finite outside the scale range
## (`stat_density()`).

Two categorical values

Here we are showing the number of each species on each island.

ggplot(
    penguins, 
    aes(x = island, fill = species)
) +
  geom_bar()

However, it is hard to compare the relative population of each species on their respective islands — for example, the proportion of Adelie on the Biscoe and Dream islands — because they have an unequal amount of penguins on each. We can normalize by the population of each island and calculate each species relative amount by using the position='fill' within geom_bar.

ggplot(
    penguins, 
    aes(x = island, fill = species)
) +
  geom_bar(position = "fill")

Two numerical variables

Scatter plots

We’ve seen these before…

ggplot(penguins, aes(x = flipper_length_mm, y = body_mass_g)) +
  geom_point()
## Warning: Removed 2 rows containing missing values or values outside the scale range
## (`geom_point()`).

Three or more variables

We can really just keep piling things on…

ggplot(penguins, aes(x = flipper_length_mm, y = body_mass_g)) +
  geom_point(aes(color = species, shape = island))
## Warning: Removed 2 rows containing missing values or values outside the scale range
## (`geom_point()`).

However adding too many aesthetic mappings to a plot makes it cluttered and difficult to make sense of. Another way, which is particularly useful for categorical variables, is to split your plot into facets, subplots that each display one subset of the data.

To facet your plot by a single variable, use facet_wrap(). The first argument of facet_wrap() is a formula, which you create with ~ followed by a variable name. The variable that you pass to facet_wrap() should be categorical.

Here “formula” is the name of the thing created by ~, not a synonym for “equation”.

# Build the basic framework. Remember aesthetics are passed down to geoms
ggplot(
    penguins, 
    aes(x = flipper_length_mm, y = body_mass_g)
) +
    # How we want to control the points
    geom_point(
        aes(color = species, shape = species)
    ) +
    # Create facets of each individual island
    facet_wrap(~island)
## Warning: Removed 2 rows containing missing values or values outside the scale range
## (`geom_point()`).

Exercises

1. The mpg data frame that is bundled with the ggplot2 package contains 234 observations collected by the US Environmental Protection Agency on 38 car models. Which variables in mpg are categorical? Which variables are numerical? (Hint: Type ?mpg to read the documentation for the dataset.) How can you see this information when you run mpg?

Here are a few ways to check this…

  1. Summary of the tibble includes types on the left
glimpse(mpg)
## Rows: 234
## Columns: 11
## $ manufacturer <chr> "audi", "audi", "audi", "audi", "audi", "audi", "audi", "…
## $ model        <chr> "a4", "a4", "a4", "a4", "a4", "a4", "a4", "a4 quattro", "…
## $ displ        <dbl> 1.8, 1.8, 2.0, 2.0, 2.8, 2.8, 3.1, 1.8, 1.8, 2.0, 2.0, 2.…
## $ year         <int> 1999, 1999, 2008, 2008, 1999, 1999, 2008, 1999, 1999, 200…
## $ cyl          <int> 4, 4, 4, 4, 6, 6, 6, 4, 4, 4, 4, 6, 6, 6, 6, 6, 6, 8, 8, …
## $ trans        <chr> "auto(l5)", "manual(m5)", "manual(m6)", "auto(av)", "auto…
## $ drv          <chr> "f", "f", "f", "f", "f", "f", "f", "4", "4", "4", "4", "4…
## $ cty          <int> 18, 21, 20, 21, 16, 18, 18, 18, 16, 20, 19, 15, 17, 17, 1…
## $ hwy          <int> 29, 29, 31, 30, 26, 26, 27, 26, 25, 28, 27, 25, 25, 25, 2…
## $ fl           <chr> "p", "p", "p", "p", "p", "p", "p", "p", "p", "p", "p", "p…
## $ class        <chr> "compact", "compact", "compact", "compact", "compact", "c…
  1. This gives similar information
str(mpg)
## tibble [234 × 11] (S3: tbl_df/tbl/data.frame)
##  $ manufacturer: chr [1:234] "audi" "audi" "audi" "audi" ...
##  $ model       : chr [1:234] "a4" "a4" "a4" "a4" ...
##  $ displ       : num [1:234] 1.8 1.8 2 2 2.8 2.8 3.1 1.8 1.8 2 ...
##  $ year        : int [1:234] 1999 1999 2008 2008 1999 1999 2008 1999 1999 2008 ...
##  $ cyl         : int [1:234] 4 4 4 4 6 6 6 4 4 4 ...
##  $ trans       : chr [1:234] "auto(l5)" "manual(m5)" "manual(m6)" "auto(av)" ...
##  $ drv         : chr [1:234] "f" "f" "f" "f" ...
##  $ cty         : int [1:234] 18 21 20 21 16 18 18 18 16 20 ...
##  $ hwy         : int [1:234] 29 29 31 30 26 26 27 26 25 28 ...
##  $ fl          : chr [1:234] "p" "p" "p" "p" ...
##  $ class       : chr [1:234] "compact" "compact" "compact" "compact" ...
  1. Apply the class function to all columns to output their class type.
sapply(mpg, class)
## manufacturer        model        displ         year          cyl        trans 
##  "character"  "character"    "numeric"    "integer"    "integer"  "character" 
##          drv          cty          hwy           fl        class 
##  "character"    "integer"    "integer"  "character"  "character"

2. Make a scatterplot of hwy vs. displ using the mpg data frame. Next, map a third, numerical variable to color, then size, then both color and size, then shape. How do these aesthetics behave differently for categorical vs. numerical variables?

ggplot(
    mpg,
    aes(x = hwy, y = displ)
) +
    geom_point(
        aes(
            color = cty,
            size = cty, # These two are handled well
            #shape = cty # This throws an error because continuous variables cannot be mapped to shapes
        )
    )

3. In the scatterplot of hwy vs. displ, what happens if you map a third variable to linewidth?

You can kind of see this below but this was a badly formulated question.

ggplot(
    mpg,
    aes(x = hwy, y = displ, linewidth = cty)
) +
  geom_point(
    shape=21, # This makes the circles with no fill
  )

4. What happens if you map the same variable to multiple aesthetics?

What you’d expect.

ggplot(
    mpg,
    aes(x = hwy, y = displ, linewidth = cty, color = cty)
) +
  geom_point(
    shape=21, # This makes the circles with no fill
  )

5. Make a scatterplot of bill_depth_mm vs. bill_length_mm and color the points by species. What does adding coloring by species reveal about the relationship between these two variables? What about faceting by species?

I think you can sort of see it already but their is a linear relationship between the two variables. Certainly more clear when you facet them.

ggplot(
    penguins,
    aes(x = bill_depth_mm, y = bill_length_mm, color = species)
) +
  geom_point(
    shape=21, # This makes the circles with no fill
  )
## Warning: Removed 2 rows containing missing values or values outside the scale range
## (`geom_point()`).

ggplot(
    penguins,
    aes(x = bill_depth_mm, y = bill_length_mm, color = species)
) +
  geom_point(
    shape=21, # This makes the circles with no fill
  ) +
    facet_wrap(~species)
## Warning: Removed 2 rows containing missing values or values outside the scale range
## (`geom_point()`).

6. Why does the following yield two separate legends? How would you fix it to combine the two legends?

ggplot(
  data = penguins,
  mapping = aes(
    x = bill_length_mm, y = bill_depth_mm, 
    color = species, shape = species
  )
) +
  geom_point() +
  labs(color = "Species")
## Warning: Removed 2 rows containing missing values or values outside the scale range
## (`geom_point()`).

The reason has to do with the labs() call, which only labels ONE of the two aesthetic properties. To combine them, we can simply label them both the same thing.

ggplot(
  data = penguins,
  mapping = aes(
    x = bill_length_mm, y = bill_depth_mm,
    color = species, shape = species
  )
) +
  geom_point() +
  labs(color = "Species", shape = "Species")
## Warning: Removed 2 rows containing missing values or values outside the scale range
## (`geom_point()`).

Note that if they do not match exactly, they will be again separated.

ggplot(
  data = penguins,
  mapping = aes(
    x = bill_length_mm, y = bill_depth_mm,
    color = species, shape = species
  )
) +
  geom_point() +
  labs(color = "Species", shape = "species")
## Warning: Removed 2 rows containing missing values or values outside the scale range
## (`geom_point()`).

7. Create the two following stacked bar plots. Which question can you answer with the first one? Which question can you answer with the second one?

ggplot(penguins, aes(x = island, fill = species)) +
  geom_bar(position = "fill")

What proportion of each species exists on each island? Which island has more Adelie species, Dream or Biscoe?

ggplot(penguins, aes(x = species, fill = island)) +
  geom_bar(position = "fill")

What proportion of each species are from each island?

Saving plots

To save plots, simply add the ggsave function at the end with a file name.

ggplot(penguins, aes(x = flipper_length_mm, y = body_mass_g)) +
  geom_point()
## Warning: Removed 2 rows containing missing values or values outside the scale range
## (`geom_point()`).

ggsave(filename = "figures/penguin-plot.png")
## Saving 7 x 5 in image
## Warning: Removed 2 rows containing missing values or values outside the scale range
## (`geom_point()`).